Last updated: 2020-10-01
This part of the study will analyse differences between our Indonesian samples and compare them with healthy European controls. From my previous analysis of healthy Indonesians, I showed that there is a high presence of Plasmodium, Flavivirus, and Bacteria, however I want to see how different this is to populations in more sterile environments.
The aim of this analysis is therefore to test whether the signature in our Indonesian samples is unique. We will test this by looking at the relative abundance of taxa, grouping by PCA, hierarchical clustering, differential abundance testing, and alpha and beta diversity estimates.
The data in this study were generated from the previously-published study (cite Natri et al).. Briefly, …. 101-bp data on an Illumina..
it was run through KMA, CCMetage, etc…
Here is the rough break down of the number of sequences obtained:
The control samples were taken from Loohuis et al.
The code below will install the packages needed to run the analyses. We’re also setting up our directories to run locally, but I’ve provided all of the paths so they can be easily modified if they need to be run on the server.
require(ggplot2)
require(RColorBrewer)
library(dplyr)
library(plyr)
library(reshape2)
library(ggpubr)
library(metacoder)
library(tidyverse)
library(phyloseq)
library(DESeq2)
library(microbiome)
library(vegan)
library(picante)
library(ALDEx2)
library(metagenomeSeq)
library(HMP)
library(dendextend)
library(selbal)
library(rms)
library(breakaway)
library(microbiomeutilities)
library(mixOmics)
library(SRS)
# set up directories
inputdir = "/Users/katalinabobowik/Documents/UniMelb_PhD/Analysis/UniMelb_Sumba/ReferenceFiles/EpiStudy/"
AllReadsdir = "/Users/katalinabobowik/Documents/UniMelb_PhD/Analysis/UniMelb_Sumba/Output/Epi_Study/All_Reads/"
Indo250Kdir = "/Users/katalinabobowik/Documents/UniMelb_PhD/Analysis/UniMelb_Sumba/Output/Epi_Study/Indo_250K/"
outputdir = "/Users/katalinabobowik/Documents/UniMelb_PhD/Analysis/UniMelb_Sumba/Output/Epi_Study/ControlSampleComparison/"
refdir = "/Users/katalinabobowik/Documents/UniMelb_PhD/Analysis/UniMelb_Sumba/ReferenceFiles/EpiStudy/"
We’ll also set up our colour schemes.
# set ggplot colour theme to white
theme_set(theme_bw())
# Set up colour scheme
IndonesiaCol="#4477AA"
NetherlandsCol="#EE6677"
Many microbiome studies use the package phyloseq to analyse data due to its comprehensive packages. The data structures in Phyloseq (taxa data, otu data, and sample data) are also contained into a single object, which makes it easy to keep everything together.
First, we’ll read in our Indonesian single ended count data and separate taxa information from read abundance information. We’ll then assign sample information to the data (namely, population and sample IDs) so that we can easily compare it to the control data.
AllREadsSE_Indo_Counts <- read.csv(paste0(refdir,"Counts_Indo_AllReads_SE.csv"),check.names=FALSE)
# Separate species' abundances and taxonomy columns
taxa_raw <- as.matrix(AllREadsSE_Indo_Counts[,c("Superkingdom","Kingdom","Phylum", "Class", "Order","Family","Genus","Species")])
abund_raw <- as.matrix(AllREadsSE_Indo_Counts[,-which(colnames(AllREadsSE_Indo_Counts) %in% c("Superkingdom","Kingdom","Phylum", "Class", "Order","Family","Genus","Species"))])
# convert to Phyloseq object
tax = tax_table(taxa_raw)
taxa = otu_table(abund_raw, taxa_are_rows = TRUE)
AllREadsSE_Indo_Counts_physeq = phyloseq(taxa, tax)
# add in sample information, starting with Island
samplenames <- colnames(otu_table(AllREadsSE_Indo_Counts_physeq))
pop <- rep("Indonesia",ncol(otu_table(AllREadsSE_Indo_Counts_physeq)))
# make this into a df and add to the Phloseq object
samples_df=data.frame(SampleName=colnames(otu_table(AllREadsSE_Indo_Counts_physeq)), SamplePop=pop)
samples = sample_data(samples_df)
rownames(samples)=samples$SampleName
sample_data(AllREadsSE_Indo_Counts_physeq) <- samples
# get information on phyloseq objects
AllREadsSE_Indo_Counts_physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2863 taxa and 123 samples ]
## sample_data() Sample Data: [ 123 samples by 2 sample variables ]
## tax_table() Taxonomy Table: [ 2863 taxa by 8 taxonomic ranks ]
For the Indonesian samples, we can see that we have a phyloseq object consisting of 2,863 taxa with 123 samples, 6 of which are replicates. We’ll take the replicate with the highest library depth and then remove the rest.
# get replicates
trimmed_samplenames = gsub("Batch1",'',samplenames) %>% gsub("Batch2",'', .) %>% gsub("Batch3",'', .)
trimmed_samplenames = sub("([A-Z]{3})([0-9]{3})", "\\1-\\2", trimmed_samplenames)
replicate_index = which(duplicated(trimmed_samplenames) | duplicated(trimmed_samplenames, fromLast = TRUE))
replicates = samplenames[replicate_index]
# add sequencing depth information to the Physeq object in order to filter replicates by seqDepth
SeqDepth = colSums(otu_table(AllREadsSE_Indo_Counts_physeq))
sample_data(AllREadsSE_Indo_Counts_physeq)$SeqDepth = SeqDepth
# find out which replicates have the highest sequencing depth
sample_data(AllREadsSE_Indo_Counts_physeq)[replicates,]
## SampleName SamplePop SeqDepth
## MPI-381Batch1 MPI-381Batch1 Indonesia 18220
## MPI-381Batch3 MPI-381Batch3 Indonesia 21685
## MTW-TLL-013Batch2 MTW-TLL-013Batch2 Indonesia 18358
## MTW-TLL013Batch3 MTW-TLL013Batch3 Indonesia 21037
## SMB-ANK-016Batch2 SMB-ANK-016Batch2 Indonesia 19738
## SMB-ANK-027Batch1 SMB-ANK-027Batch1 Indonesia 18822
## SMB-ANK-027Batch2 SMB-ANK-027Batch2 Indonesia 8689
## SMB-ANK016Batch3 SMB-ANK016Batch3 Indonesia 14552
## SMB-ANK027Batch3 SMB-ANK027Batch3 Indonesia 17962
## SMB-WNG-021Batch1 SMB-WNG-021Batch1 Indonesia 20937
## SMB-WNG021Batch3 SMB-WNG021Batch3 Indonesia 18590
replicateDF=as.data.frame(sample_data(AllREadsSE_Indo_Counts_physeq)[replicates,])
replicateDF$SampleName = sub("([A-Z]{3})([0-9]{3})", "\\1-\\2", replicateDF$SampleName)
replicateDF$SampleName = gsub("Batch1",'',replicateDF$SampleName) %>% gsub("Batch2",'', .) %>% gsub("Batch3",'', .)
replicateDF=replicateDF[with(replicateDF, order(-SeqDepth)), ]
removeReplicates=rownames(replicateDF[which(duplicated(replicateDF$SampleName)),])
keepReplicates=rownames(sample_data(AllREadsSE_Indo_Counts_physeq))[-which(rownames(sample_data(AllREadsSE_Indo_Counts_physeq)) %in% removeReplicates)]
# prune these out
AllREadsSE_Indo_Counts_physeq=prune_samples(keepReplicates,AllREadsSE_Indo_Counts_physeq)
AllREadsSE_Indo_Counts_physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2863 taxa and 117 samples ]
## sample_data() Sample Data: [ 117 samples by 3 sample variables ]
## tax_table() Taxonomy Table: [ 2863 taxa by 8 taxonomic ranks ]
The next thing to do is to set up the control samples. Like we did for the Indonesia samples, we need to convert the data into a phyloseq object, then add in sample information.
control_100K_Counts <- read.csv(paste0(refdir,"Controls_NoFiltering_Counts.csv"),check.names=FALSE)
# Separate species' abundances and taxonomy columns
taxa_raw <- as.matrix(control_100K_Counts[,c("Superkingdom","Kingdom","Phylum", "Class", "Order","Family","Genus","Species")])
abund_raw <- as.matrix(control_100K_Counts[,-which(colnames(control_100K_Counts) %in% c("Superkingdom","Kingdom","Phylum", "Class", "Order","Family","Genus","Species"))])
# convert to Phyloseq object
tax = tax_table(taxa_raw)
taxa = otu_table(abund_raw, taxa_are_rows = TRUE)
control_100K_Counts_physeq = phyloseq(taxa, tax)
# get information on phyloseq object
control_100K_Counts_physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 971 taxa and 192 samples ]
## tax_table() Taxonomy Table: [ 971 taxa by 8 taxonomic ranks ]
The control dataset has sample information from individuals with mental illness, as well as controls. Since we only want to compare our healthy Indonesian individuals to healthy controls, we’ll remove all samples with mental illlness from the phyloseq object.
# text file made from script 'GetDiseaseStatus_ControlSamples.sh'
diseaseStatus=read.table(paste0(refdir,"ControlSamples_DiseaseStatus.txt"))
colnames(diseaseStatus)=c("Samples","diseaseStatus")
controlSamples=as.character(diseaseStatus[grep("CSF", diseaseStatus$diseaseStatus),"Samples"])
# prune out samples we don't want
control_100K_Counts_physeq=prune_samples(controlSamples,control_100K_Counts_physeq)
# add in sample information, i.e., the sample names and population they're from
samplenames <- colnames(otu_table(control_100K_Counts_physeq))
pop <- rep("Netherlands",ncol(otu_table(control_100K_Counts_physeq)))
# make this into a df and add to the Phloseq object
samples_df=data.frame(SampleName=colnames(otu_table(control_100K_Counts_physeq)), SamplePop=pop)
samples = sample_data(samples_df)
rownames(samples)=samples$SampleName
sample_data(control_100K_Counts_physeq) <- samples
# get phyloseq summary information
control_100K_Counts_physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 971 taxa and 49 samples ]
## sample_data() Sample Data: [ 49 samples by 2 sample variables ]
## tax_table() Taxonomy Table: [ 971 taxa by 8 taxonomic ranks ]
We now have a phyloseq object of 49 samples with 471 taxa.
The next step is to merge both the control and Indonesian data together. Phyloseq makes this really easy by simply using the merge_phyloseq() function. We’ll then get some summary stats on the data.
merged_phylo_counts=merge_phyloseq(AllREadsSE_Indo_Counts_physeq, control_100K_Counts_physeq)
merged_phylo_counts
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2863 taxa and 166 samples ]
## sample_data() Sample Data: [ 166 samples by 3 sample variables ]
## tax_table() Taxonomy Table: [ 2863 taxa by 8 taxonomic ranks ]
# summarise the dataset
summarize_phyloseq(merged_phylo_counts)
## [[1]]
## [1] "1] Min. number of reads = 905"
##
## [[2]]
## [1] "2] Max. number of reads = 419066"
##
## [[3]]
## [1] "3] Total number of reads = 2945114"
##
## [[4]]
## [1] "4] Average number of reads = 17741.6506024096"
##
## [[5]]
## [1] "5] Median number of reads = 13813.5"
##
## [[6]]
## [1] "7] Sparsity = 0.963026398293138"
##
## [[7]]
## [1] "6] Any OTU sum to 1 or less? YES"
##
## [[8]]
## [1] "8] Number of singletons = 610"
##
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n (i.e. exactly one read detected across all samples)17.8833391547328"
##
## [[10]]
## [1] "10] Number of sample variables are: 3"
##
## [[11]]
## [1] "SampleName" "SamplePop" "SeqDepth"
# Finally, we'll visualise library sizes before any data preprocessing. To do this, we need to assign sequencing depth to the sample data
SeqDepth = colSums(otu_table(merged_phylo_counts))
sample_data(merged_phylo_counts)$SeqDepth = SeqDepth
# make a barplot of library sizes
ggplot(meta(merged_phylo_counts), aes(SampleName, SeqDepth)) + geom_bar(stat = "identity", aes(fill = SamplePop)) +
scale_fill_manual(values = c(IndonesiaCol,NetherlandsCol)) + rotate_x_text()
The easiest way to get rid of some error in your data is to throw out any count information below some threshold. Oddly, in microbiomics, there’s no set thresholding for this. In the end, it’s really a comporomise between accuracy and keeping rare taxa. What IS decided is that filtering out at least singletons is standard, since these are regarded as sources of error or contamination. Some resources I found that were helpful on this can be found here and here.
In my own data, we can see that there’s a high number of low counts within the data.
SeqDepth = colSums(otu_table(merged_phylo_counts))
sample_data(merged_phylo_counts)$SeqDepth = SeqDepth
# histogram of data
ggplot(meta(merged_phylo_counts)) + geom_histogram(aes(x = SeqDepth), alpha= 0.6, bins=100) + facet_wrap(~SamplePop)
With a higher sequencing depth, you can afford to play around with the thresholding, however for my library size, the sequencing depth is varaible and quite low in some samples. Therefore, pushing this threshold up too high will eliminate rare taxa, especially given that we didn’t have a high library size to begin with.
Although our starting library size is smalol, let’s explore the data a bit by looking at the effect of removing singletons.
A great tool to do this is raraefaction curves. Rarefaction curves are commonly used in microbiomics to estimate 1) species richness and 2) determine how extensively a library was sampled. For the first point, it’s nearly impossible to capture all species within a community, and therefore this allows for a way to estimate the total species we would expect to find by extrapolating the curve of the rarefaction plot. A rarefaction curve will (if sampled to a high enough depth) have an asymptote, and this is regarded as th eestimate of the total number of species within that community.
For the second point, we can also use the asymptote of the curve to see how extensively we sampled. If the curve has not started to flatten off, we have not captured everything.
Lets see how rarefaction looks like when we remove singletons and when we remove 5 counts.
# Separate species' abundances and taxonomy columns
rarecurve_counts <- otu_table(merged_phylo_counts)
# Try with different filtering thresholds:
for (i in c(1,5)){
rarecurve_counts[rarecurve_counts<=i]<-0
rarecurve(t(otu_table(rarecurve_counts, taxa_are_rows = TRUE)), step=20, col=c(rep(IndonesiaCol,sum(sample_data(merged_phylo_counts)[,"SamplePop"]=="Indonesia")),rep(NetherlandsCol,sum(sample_data(merged_phylo_counts)[,"SamplePop"]=="Netherlands"))),label=F, xlab="Counts",ylab="Number Species",main=paste("Removing Reads",i,"And Below",sep=" "),xlim=c(0,50000))
}
From the rarefaction curves, we can notice a few things. For one, the number of reads needed to capture all of the diversity in the European reads is much lower. You can see that it asympotes at around 1000, whereas for the Indonesian reads, the asymtote ranges from around 5,000 to much higher for a handful of other samples (all of these samples with the asymptote that’s not captured in the plot are sampled with detected Plasmodium in their blood).
Let’s zoom in a bit to see that.
# Separate species' abundances and taxonomy columns
rarecurve_counts <- otu_table(merged_phylo_counts)
# Try with different filtering thresholds:
for (i in c(1,5)){
rarecurve_counts[rarecurve_counts<=i]<-0
rarecurve(t(otu_table(rarecurve_counts, taxa_are_rows = TRUE)), step=20, col=c(rep(IndonesiaCol,sum(sample_data(merged_phylo_counts)[,"SamplePop"]=="Indonesia")),rep(NetherlandsCol,sum(sample_data(merged_phylo_counts)[,"SamplePop"]=="Netherlands"))),label=F, xlab="Counts",ylab="Number Species",main=paste("Removing Reads",i,"And Below",sep=" "),xlim=c(0,5000))
}
We also notice that there’s also a lower sample size in the Europeans, so it’s not entirely a fair comparison but we’ll be fixing that up later on.
We can also see on the right hand side (with a thresholding of removing 5 singletons)that the curve asymptotes much sooner. This is beacuse you need more reads to detect rare species.
As mentioned before, because our starting read depth is small, we will stick with removing singletons. We will also make a copy of the unfiltered data for downstream use.
# Filter out singletons
merged_phylo_counts_unfiltered=merged_phylo_counts
otu_table(merged_phylo_counts)[otu_table(merged_phylo_counts)<=1]<-0
# filter out any 0's that remain in the dataframe
any(taxa_sums(merged_phylo_counts) == 0)
## [1] TRUE
# TRUE
merged_phylo_counts <- prune_taxa(taxa_sums(merged_phylo_counts) > 0, merged_phylo_counts)
# summarise the phyloseq dataset
summarize_phyloseq(merged_phylo_counts)
## [[1]]
## [1] "1] Min. number of reads = 903"
##
## [[2]]
## [1] "2] Max. number of reads = 419027"
##
## [[3]]
## [1] "3] Total number of reads = 2940063"
##
## [[4]]
## [1] "4] Average number of reads = 17711.2228915663"
##
## [[5]]
## [1] "5] Median number of reads = 13775.5"
##
## [[6]]
## [1] "7] Sparsity = 0.962248392971285"
##
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
##
## [[8]]
## [1] "8] Number of singletons = 0"
##
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n (i.e. exactly one read detected across all samples)0"
##
## [[10]]
## [1] "10] Number of sample variables are: 3"
##
## [[11]]
## [1] "SampleName" "SamplePop" "SeqDepth"
Now let’s plot the library size and the number of OTUs for each sample after removing singletons.
# barplot of library sizes
ggplot(meta(merged_phylo_counts), aes(SampleName, SeqDepth)) + geom_bar(stat = "identity", aes(fill = SamplePop)) +
scale_fill_manual(values = c(IndonesiaCol,NetherlandsCol)) + rotate_x_text()
# get barplot of total counts per individual
nOTUs = colSums(otu_table(merged_phylo_counts)!=0)
sample_data(merged_phylo_counts)$nOTUs = nOTUs
# barplot of OTUs
ggplot(meta(merged_phylo_counts), aes(SampleName, nOTUs)) + geom_bar(stat = "identity", aes(fill = SamplePop)) +
scale_fill_manual(values = c(IndonesiaCol,NetherlandsCol)) + rotate_x_text()
We can see that the Indonesian samples have a much higher library size and also a higher number of OTUs within each individual.
You can also see that some Indonesian samples have a much higher read depth than other samples. These are all samples which were previously confirmed to have Plasmodium (in my second study).
From the script X, we saw that human reads and viridiplantae are not of interest because of X. We’ll filter these out.
merged_phylo_counts=subset_taxa(merged_phylo_counts, (Kingdom!="Viridiplantae"))
merged_phylo_counts=subset_taxa(merged_phylo_counts, (Kingdom!="Metazoa"))
# summarise data
merged_phylo_counts
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1492 taxa and 166 samples ]
## sample_data() Sample Data: [ 166 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 1492 taxa by 8 taxonomic ranks ]
The library sizes between samples and groups is highly variable, and therefore comparing the data to each other will result in biased results.
There are two ways of handling this: 1. Performing a transformation of the data 2. Rarefying the data
Taxa can be viewed by their relative abundance, however changes in the abundance of one taxon will result in changing the abundance of other taxa.
One of the ways to handle this is to transform the data using Centered Log Ratio (CLR)transformation. CLR data shows how OTUs behave relative to the per-sample average and is a commonly-used data transformation method in microbiomics.
Another cool thing about using CLR-transformed data is that you they are not affected by sequencing depth. This is an excerpt explaining from a paper by Gloor et al explaining that:
“The clr-transformed values are scale-invariant; that is the same ratio is expected to be obtained in a sample with few read counts or an identical sample with many read counts, only the precision of the clr estimate is affected. This is elaborated in the “Probability” and “Log-ratio transformations” section in the Supplement, but the consequence is that count normalization is unnecessary and indeed, undesirable since information on precision is lost."
Unfortunately, one of the disadvantages to using CLR-transformed data is that it can’t be used in diversity estimates, and it’s also hard to visualise.
Becasue CLR data is an informative measure of our data, I’ll first explore sample grouping using this method.
The first step to performing a CLR transformation on the data is to add an offset of 1 to the counts. This is necessary, since performing a log on 0 values is undefined. We’ll then perform a log ratio transformation of the data using the mixOmics package.
offset_otu=otu_table(merged_phylo_counts)+1
transform_counts=t(otu_table(offset_otu))
data_clr <- logratio.transfo(as.matrix(transform_counts), logratio = 'CLR', offset = 0)
Now that we’ve transformed our data, we can make a PCA plot to see how each sample clusters. The current obect we have is a CLR-class object. You can plot this type of data object easily with miOmics, however I prefer the visualisation that phyloseq offers (you can’t alter the PCA plots that much in mixOmics). So, we’ll turn the clr object back into a phyloseq object and make an ordination plot of the data.
!Note: When Euclidean distances are used in PCoA plots, it is equivalent to a PCA plot.
# Make a duplicated phyloseq object to use for plotting
class(data_clr)="matrix"
## Warning in class(data_clr) = "matrix": Setting class(x) to "matrix" sets
## attribute to NULL; result will no longer be an S4 object
taxa = otu_table(t(data_clr), taxa_are_rows = TRUE)
merged_phylo_counts_clr=merged_phylo_counts
otu_table(merged_phylo_counts_clr)=taxa
out.wuf.log <- ordinate(merged_phylo_counts_clr, method = "PCoA", distance = "euclidean")
plot_ordination(merged_phylo_counts_clr, out.wuf.log, color="SamplePop", axes = 1:2, label="SampleName") + scale_colour_manual(values=c(IndonesiaCol,NetherlandsCol))
We can see that the first principal component is separating the two studies, while the second principal component separates out the malaria samples. I’ll highlight the samples by Plasmodium abundance to demonstrate that point.
# counts are on a log scale since there's high variation between samples (makes for easier colour visualisation)
logged_phyla_counts = log10(colSums(otu_table(merged_phylo_counts)[grep("Apicomplexa",tax_table(merged_phylo_counts)[,"Phylum"])])+1)
sample_data(merged_phylo_counts_clr)[["Apicomplexa"]] = logged_phyla_counts
out.wuf.log <- ordinate(merged_phylo_counts_clr, method = "PCoA", distance = "euclidean")
plot_ordination(merged_phylo_counts_clr, out.wuf.log, color="Apicomplexa", axes = 1:2, label="SampleName")
If we go down to PC3 versus PC4, we can see the European study separating along the fourth PC.
plot_ordination(merged_phylo_counts_clr, out.wuf.log, color="SamplePop", axes = 3:4, label="SampleName") + scale_colour_manual(values=c(IndonesiaCol,NetherlandsCol))
One of the most common ways to normalise data is through rarefying, or subsampling, the entire library to the lowest read depth in the dataset. Rarefaction is performed by drawing reads without replacement from each sample so that all samples have the same number of total counts. This process standardizes the library size across samples and is especially important for calulating diversity metrics, where read depth influences microbe diversity.
From what I’ve found, many people in the microbiomics community have very strong opinions about rarefying data. Some camps, such as QIIME, think that it’s just fine, whereas others, such as the creators of Phyloseq, strongly advise against it. A seminal, and of the most well-referenced papers about the disadvantages of rarefying data, was published in 2014 by McMurdie & Holmes. The arguments they have echo the main concerns seen within the ‘anti-rarefy’ camps: namely that rarefying data 1) decreases the ability to detect rare taxa and 2) can lead to unequally-rarefied data due to rare taxa being over or underrepresented in libraries normalised to a small size.
Although some authors argue that rarefaction is, in fact, a perfectly suitable option (particularly for small library sizes), I do think this is a valid point.
Recently, a package called SRS (standing for Scaling with Ranked Subsampling) came out in R and it preserves OTU frequencies by 1) scaling counts by a constant factor where the sum of the scaled counts equals the minimum library size chosen by the user and 2) performing a ranked subsampling on the data.
For this study, I chose Cmin (i.e., the number of counts to which all samples will be normalized) to be my minimum library size in the control dataset, which is 903 counts.
SRS won’t work if we have samples with a library size under this threshold, so let’s remove all samples under 903, then perform SRS.
minControl=min(sample_sums(phyloseq::subset_samples(merged_phylo_counts, SamplePop == "Netherlands")))
keep=names(which(sample_sums(merged_phylo_counts)>=minControl))
pruned_data=prune_samples(keep, merged_phylo_counts)
any(taxa_sums(pruned_data) == 0)
## [1] TRUE
# TRUE
pruned_data <- prune_taxa(taxa_sums(pruned_data) > 0, pruned_data)
any(taxa_sums(pruned_data) == 0)
## [1] FALSE
# FALSE
pruned_data_df=as.data.frame(otu_table(pruned_data))
SRS=SRS(pruned_data_df,minControl)
rownames(SRS)=rownames(pruned_data_df)
# transform back into phyloseq object
taxa = otu_table(SRS, taxa_are_rows = TRUE)
otu_table(pruned_data)=taxa
any(taxa_sums(pruned_data) == 0)
## [1] TRUE
# TRUE
pruned_data <- prune_taxa(taxa_sums(pruned_data) > 0, pruned_data)
any(taxa_sums(pruned_data) == 0)
## [1] FALSE
# FALSE
Now let’s make sure the library sizes are the same and see how the OTU numbers look like between datasets.
SeqDepthPruned = sample_sums(pruned_data)
sample_data(pruned_data)$SeqDepthPruned = SeqDepthPruned
# barplot of library sizes
ggplot(meta(pruned_data), aes(SampleName, SeqDepthPruned)) + geom_bar(stat = "identity", aes(fill = SamplePop)) +
scale_fill_manual(values = c(IndonesiaCol,NetherlandsCol)) + rotate_x_text()
# get barplot of total counts per individual
nOTUs = colSums(otu_table(pruned_data)!=0)
sample_data(pruned_data)$nOTUs = nOTUs
# barplot of OTUs
ggplot(meta(pruned_data), aes(SampleName, nOTUs)) + geom_bar(stat = "identity", aes(fill = SamplePop)) +
scale_fill_manual(values = c(IndonesiaCol,NetherlandsCol)) + rotate_x_text()
## Exploratory plots after rarefaction
Now that we’ve rarefied the data, we can look at how the samples group.
mergedlog <- transform_sample_counts(pruned_data, function(x) log(1 + x))
out.wuf.log <- ordinate(mergedlog, method = "PCoA", distance = "euclidean")
plot_ordination(mergedlog, out.wuf.log, color="SamplePop", axes = 1:2, label="SampleName")
As before, we see PC1 separating out the samples by study. But, rather than those clusters of plasmodium coming out along PC2, now we see that it’s plasmodium AND Flaviviridae.
logged_phyla_counts = log10(colSums(otu_table(pruned_data)[grep("Apicomplexa",tax_table(pruned_data)[,"Phylum"])])+1)
sample_data(pruned_data)[["Apicomplexa"]] = logged_phyla_counts
mergedlog <- transform_sample_counts(pruned_data, function(x) log(1 + x))
out.wuf.log <- ordinate(mergedlog, method = "PCoA", distance = "euclidean")
print(plot_ordination(mergedlog, out.wuf.log, color="Apicomplexa", axes = 2:3, label="SampleName"))
logged_phyla_counts = log10(colSums(otu_table(pruned_data)[grep("Flaviviridae",tax_table(pruned_data)[,"Family"])])+1)
sample_data(pruned_data)[["Flaviviridae"]] = logged_phyla_counts
mergedlog <- transform_sample_counts(pruned_data, function(x) log(1 + x))
out.wuf.log <- ordinate(mergedlog, method = "PCoA", distance = "euclidean")
print(plot_ordination(mergedlog, out.wuf.log, color="Flaviviridae", axes = 2:3, label="SampleName"))
Instead of looking at the major sources of variance within samples, can also investigate the major sources of varaince by taxa. So that it isn’t so hard to visualise, we’ll separate each individual taxa so they’re easier to view.
The clustering comes out best when looking at PC3 vs PC4, which I’ll show here.
p1=plot_ordination(mergedlog, out.wuf.log, type="taxa", color="Phylum", title="taxa", axes = 3:4)
# separate this out by using facet_wrap
p1 + facet_wrap(~Phylum, 3)
Now that we’ve rarefied the data, we can look at the relative abundance of taxa for each sample. To visualise this, I want to show my taxa at the family level (see script X), however since there are nearly 200 unique taxa at the family level, this would be difficult to visualise with so many colours. Instead, we’ll make a new taxa variable which combines Superkingdom information with Family-level information. This way, we can highlight colours by superkingdom (which isn’t so visually overwhelming), and still preserve Family-level information.
# add a new column containing family names and superkingdom
tax_table(pruned_data)[,"Superkingdom"] = paste(tax_table(pruned_data)[,"Superkingdom"], tax_table(pruned_data)[,"Family"], sep="_")
tax_table(pruned_data)[,"Superkingdom"] <- gsub("Bacteria_$", "Bacteria_unclassified", tax_table(pruned_data)[,"Superkingdom"])
tax_table(pruned_data)[,"Superkingdom"] <- gsub("Eukaryota_$", "Eukaryota_unclassified", tax_table(pruned_data)[,"Superkingdom"])
tax_table(pruned_data)[,"Superkingdom"] <- gsub("Viruses_$", "Viruses_unclassified", tax_table(pruned_data)[,"Superkingdom"])
As pointed out, we have a lot of taxa at the family level (178 to be exact), and it would be hard to look over everything at once. Instead, we can focus on the top X number of taxa and highlight everything else in another colour.
Here, I chose to highlight the top 20 taxa, since thats still representative while not being too visually exhausting.
# For some reason, top is actually top + 1, so here it would be 20
aggregated_phyloCounts <- aggregate_top_taxa(pruned_data, "Superkingdom", top = 19)
# transform to relative counts
relative_phyloCounts <- microbiome::transform(aggregated_phyloCounts, "compositional")
# Remove weird extra family names added at the end of Superkingdom names
tax_table(relative_phyloCounts)[,"Superkingdom"] <- paste(sapply(strsplit(taxa_names(relative_phyloCounts), "[_.]"), `[`, 1), sapply(strsplit(taxa_names(relative_phyloCounts), "[_.]"), `[`, 2), sep="_")
# Change "Other_NA" to just "Other"
tax_table(relative_phyloCounts)[,"Superkingdom"][grep("Other", taxa_names(relative_phyloCounts))] = "Other"
# Plot
p=plot_bar(relative_phyloCounts, fill = "Superkingdom")
# set colour palette
families=levels(p$data$Superkingdom)
# get number of families in each kingdom
table(sapply(strsplit(families, "[_.]"), `[`, 1))
##
## Bacteria Eukaryota Other Viruses
## 16 2 1 1
PaletteBacteria = colorRampPalette(c("#023858","#74a9cf"))(16)
PaletteEukaryote = colorRampPalette(c("#fd8d3c","#800026"))(2)
PaletteOther = colorRampPalette(c("black"))(1)
PaletteVirus = colorRampPalette(c("#78c679","#006837"))(1)
Merged_Palette <- c(PaletteBacteria,PaletteEukaryote,PaletteOther,PaletteVirus)
phyloseq::plot_bar(relative_phyloCounts, fill = "Superkingdom") +
geom_bar(aes(fill = Superkingdom), stat = "identity", position = "stack") +
labs(x = "", y = "Relative Abundance\n") +
facet_wrap(~ SamplePop, scales = "free") + scale_fill_manual(values=Merged_Palette) +
theme(panel.background = element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())
We can see that the Indonesian samples have a lot more variability in the taxa they have within whole blood. # Boxplots
phyloseq::psmelt(pruned_data) %>%
ggplot(data = ., aes(x = SamplePop, y = Abundance)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(aes(color = Phylum), height = 0, width = .2) +
labs(x = "", y = "Abundance\n") +
facet_wrap(~ Phylum, scales = "free")
ps_rel_abund = phyloseq::transform_sample_counts(pruned_data, function(x){x / sum(x)})
ps_rel_otu <- data.frame(phyloseq::otu_table(ps_rel_abund))
ps_rel_otu <- t(ps_rel_otu)
bc_dist <- vegan::vegdist(ps_rel_otu, method = "bray")
ward <- as.dendrogram(hclust(bc_dist, method = "ward.D2"))
#Provide color codes
meta <- data.frame(phyloseq::sample_data(ps_rel_abund))
colorCode <- c(Indonesia = IndonesiaCol, `Netherlands` = NetherlandsCol)
labels_colors(ward) <- colorCode[meta$SamplePop][order.dendrogram(ward)]
#Plot
plot(ward)
# Differnetial abundance testing
You can also do a test on this to see if theyre diffrent.
conduct a multivariate test for differences in the overall composition between groups of samples. This type of test can be implemented using the HMP package (Xdc.sevsample function) described in the paper Hypothesis Testing and Power Calculations for Taxonomic-Based Human Microbiome Data by La Rosa et. al.
differential_abund=pruned_data
otu_table(differential_abund)=otu_table(differential_abund)+1
taxa_names(differential_abund)=make.unique(tax_table(differential_abund)[,"Phylum"])
diagdds = phyloseq_to_deseq2(differential_abund, ~ SamplePop)
# had help from this: https://help.galaxyproject.org/t/error-with-deseq2-every-gene-contains-at-least-one-zero/564/2
diagdds = DESeq(diagdds, test="Wald", fitType="parametric")
res = results(diagdds, cooksCutoff = FALSE)
alpha = 0.01
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(differential_abund)[rownames(sigtab), ], "matrix"))
head(sigtab)
## baseMean log2FoldChange lfcSE stat pvalue
## Actinobacteria.11 2.809784 -2.1871668 0.3080522 -7.099989 1.247670e-12
## Actinobacteria.21 1.447464 -0.8884540 0.2714890 -3.272523 1.065921e-03
## Actinobacteria.29 1.667682 -1.1940024 0.3363450 -3.549934 3.853284e-04
## Actinobacteria.88 18.911925 -5.1898070 0.3829265 -13.553012 7.604484e-42
## Actinobacteria.95 3.449592 -2.5417807 0.3732551 -6.809769 9.775564e-12
## Actinobacteria.98 1.500033 -0.9673098 0.3217448 -3.006451 2.643169e-03
## padj Superkingdom Kingdom
## Actinobacteria.11 1.386301e-11 Bacteria_unk_f unk_k
## Actinobacteria.21 3.553069e-03 Bacteria_Corynebacteriaceae unk_k
## Actinobacteria.29 1.427142e-03 Bacteria_Gordoniaceae unk_k
## Actinobacteria.88 2.851682e-40 Bacteria_Micrococcaceae unk_k
## Actinobacteria.95 9.775564e-11 Bacteria_Micrococcaceae unk_k
## Actinobacteria.98 7.929506e-03 Bacteria_Micrococcaceae unk_k
## Phylum Class Order
## Actinobacteria.11 Actinobacteria Actinobacteria Actinomycetales
## Actinobacteria.21 Actinobacteria Actinobacteria Corynebacteriales
## Actinobacteria.29 Actinobacteria Actinobacteria Corynebacteriales
## Actinobacteria.88 Actinobacteria Actinobacteria Micrococcales
## Actinobacteria.95 Actinobacteria Actinobacteria Micrococcales
## Actinobacteria.98 Actinobacteria Actinobacteria Micrococcales
## Family Genus
## Actinobacteria.11 unk_f unk_g
## Actinobacteria.21 Corynebacteriaceae Corynebacterium
## Actinobacteria.29 Gordoniaceae Gordonia
## Actinobacteria.88 Micrococcaceae Arthrobacter
## Actinobacteria.95 Micrococcaceae Micrococcus
## Actinobacteria.98 Micrococcaceae Micrococcus
## Species
## Actinobacteria.11 uncultured Actinomycetales bacterium
## Actinobacteria.21 uncultured Corynebacterium sp.
## Actinobacteria.29 Gordonia sp. KTR9
## Actinobacteria.88 Arthrobacter sp. JEK-2009
## Actinobacteria.95 Micrococcus luteus
## Actinobacteria.98 Micrococcus sp. ALBL_223
library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
scale_fill_brewer(palette = palname, ...)
}
# Phylum order
x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
# Genus order
x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
x = sort(x, TRUE)
#sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
ggplot(sigtab, aes(x=log2FoldChange, y=Phylum, color=Phylum)) + geom_point(size=6) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
geom_vline(xintercept = 0.0, color = "gray", size = 0.5)
alpha diversity does these things…
#Generate a data.frame with adiv measures
adiv <- data.frame(
"Observed" = phyloseq::estimate_richness(pruned_data, measures = "Observed"),
"Shannon" = phyloseq::estimate_richness(pruned_data, measures = "Shannon"),
"Simpson" = phyloseq::estimate_richness(pruned_data, measures = "Simpson"),
"Status" = phyloseq::sample_data(pruned_data)$SamplePop)
head(adiv)
## Observed Shannon Simpson Status
## MPI.025 6 0.2081923 0.07542963 Indonesia
## MPI.048 38 1.8709799 0.67514830 Indonesia
## MPI.061 8 0.1842899 0.06277341 Indonesia
## MPI.065 25 1.3447154 0.57840912 Indonesia
## MPI.296 37 0.7191799 0.22182733 Indonesia
## MPI.334 14 0.3058554 0.10278523 Indonesia
adiv %>%
gather(key = metric, value = value, c("Observed", "Shannon","Simpson")) %>%
mutate(metric = factor(metric, levels = c("Observed", "Shannon","Simpson"))) %>%
ggplot(aes(x = Status, y = value)) +
geom_boxplot(outlier.color = NA) +
geom_jitter(aes(color = Status), height = 0, width = .2) +
labs(x = "", y = "") +
facet_wrap(~ metric, scales = "free") +
theme(legend.position="none") + scale_colour_manual(values=c(IndonesiaCol,NetherlandsCol))
adiv %>%
group_by(Status) %>%
dplyr::summarise(median_observed = median(Observed),
median_shannon = median(Shannon),
median_simpson = median(Simpson))
## # A tibble: 2 x 4
## Status median_observed median_shannon median_simpson
## <fct> <dbl> <dbl> <dbl>
## 1 Indonesia 59 2.84 0.875
## 2 Netherlands 22 1.63 0.732
wilcox.test(Observed ~ Status, data = adiv, exact = FALSE, conf.int = TRUE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Observed by Status
## W = 1985, p-value = 5.329e-09
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 27.99996 43.00006
## sample estimates:
## difference in location
## 36.99999
wilcox.test(Shannon ~ Status, data = adiv, conf.int = TRUE)
##
## Wilcoxon rank sum test
##
## data: Shannon by Status
## W = 1646, p-value = 0.0005915
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 0.8121811 1.3925959
## sample estimates:
## difference in location
## 1.18072
wilcox.test(Simpson ~ Status, data = adiv, conf.int = TRUE)
##
## Wilcoxon rank sum test
##
## data: Simpson by Status
## W = 1530, p-value = 0.01032
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 0.07078411 0.15843338
## sample estimates:
## difference in location
## 0.1287158
Beta diversity does this…
ps_clr <- microbiome::transform(pruned_data, "clr")
# RDA without constraints is PCA
ord_clr <- phyloseq::ordinate(ps_clr, "RDA")
clr1 <- ord_clr$CA$eig[1] / sum(ord_clr$CA$eig)
clr2 <- ord_clr$CA$eig[2] / sum(ord_clr$CA$eig)
phyloseq::plot_ordination(pruned_data, ord_clr, type="samples", color="SamplePop",label = "SampleName") +
geom_point(size = 2) +
coord_fixed(clr2 / clr1) +
stat_ellipse(aes(group = SamplePop), linetype = 2) + scale_colour_manual(values=c(IndonesiaCol,NetherlandsCol))
ps4.rel <- microbiome::transform(pruned_data, "compositional")
bx.ord_pcoa_bray <- ordinate(ps4.rel, "PCoA", "bray")
# Make an ordination plot using bray's dissimilarity
beta.ps1 <- plot_ordination(ps4.rel,
bx.ord_pcoa_bray,
color="SamplePop",
label = "SampleName") +
geom_point(aes(), size= 4) +
theme(plot.title = element_text(hjust = 0, size = 12))
# add in an ellipse
beta.ps1 + stat_ellipse() + theme_bw(base_size = 14) +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
scale_colour_manual(values=c(IndonesiaCol,NetherlandsCol))
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure